Thumbnail Summarization Techniques for Web Archives
نویسندگان
چکیده
Thumbnails of archived web pages as they appear in common browsers such as Firefox or Chrome can be useful to convey the nature of a web page and how it has changed over time. However, creating thumbnails for all archived web pages is not feasible for large collections, both in terms of time to create the thumbnails and space to store them. Furthermore, at least for the purposes of initial exploration and collection understanding, people will likely only need a few dozen thumbnails and not thousands. In this paper, we develop different algorithms to optimize the thumbnail creation procedure for web archives based on information retrieval techniques. We study different features based on HTML text that correlate with changes in rendered thumbnails so we can know in advance which archived pages to use for thumbnails. We find that SimHash correlates with changes in the thumbnails (ρ = 0.59, p < 0.005). We propose different algorithms for thumbnail creation suitable for different applications, reducing the number of thumbnails to be generated to 9% – 27% of the total size.
منابع مشابه
Visualizing Digital Collections of Web Archives
An important problem in web archiving is understanding and presenting how a single page changes over time. This is not only important for researchers, but can also be useful in educating the general public about the temporal and dynamic nature of the web. A common method for presenting webpage change is to display a set of thumbnails of the mementos, or archived pages. Although this can be a us...
متن کاملGetting Information from Documents You Cannot Read: An Interactive Cross-Language Text Retrieval and Summarization System
In this paper we discuss research designed to investigate the ability of users to find information in texts written in languages unknown to them. One study shows how document thumbnail visualizations can be used effectively to choose potentially relevant documents. Another study shows how a user of a cross-language text retrieval system who has no foreign language knowledge can never-the-less c...
متن کاملKeizai: An Interactive Cross-Language Text Retrieval System
Can we expect people to be able to get information from texts in languages they cannot read? In this paper we review two relevant lines of research bearing on this question and will show how our results are being used in the design of a new Web interface for cross-language text retrieval. One line of research, “Interactive IR”, is concerned with the user interface issues for information retriev...
متن کاملText Summarization Using Cuckoo Search Optimization Algorithm
Today, with rapid growth of the World Wide Web and creation of Internet sites and online text resources, text summarization issue is highly attended by various researchers. Extractive-based text summarization is an important summarization method which is included of selecting the top representative sentences from the input document. When, we are facing into large data volume documents, the extr...
متن کاملImproving Cross-Language Text Retrieval with Human Interactions
Can we expect people to be able to get information from texts in languages they cannot read? In this paper we review two relevant lines of research bearing on this question and will show how our results are being used in the design of a new Web interface for cross-language text retrieval. One line of research, “Interactive IR”, is concerned with the user interface issues for information retriev...
متن کامل